25. Number of Features and Overfitting
Number of Features and Overfitting
Question:
A classic way to overfit an algorithm is by using lots of features and not a lot of training data. You can find the starter code in
feature_selection/find_signature.py
. Get a decision tree up and training on the training data, and print out the accuracy.
How many training points are there, according to the starter code?
Start Quiz:

INSTRUCTOR NOTE:
Special Note: Depending on when you downloaded the code provided for
find_signature.py
, you may need to change the code in lines 9-10 to be
words_file = "../text_learning/your_word_data.pkl"
authors_file = "../text_learning/your_email_authors.pkl"
so that the files created from running
vectorize_text.py
are reflected properly.
In addition, if you are having trouble getting the code to run due to memory issues, then if you are on version 0.16.x of scikit-learn, you can remove the
.toarray()
function from the line where
features_train
is created to save on memory - the decision tree classifier can, in that version take as input a sparse array instead of only dense arrays.